Goto

Collaborating Authors

 average reward criterion


Inverse Reinforcement Learning with the Average Reward Criterion

Neural Information Processing Systems

We study the problem of Inverse Reinforcement Learning (IRL) with an average-reward criterion. The goal is to recover an unknown policy and a reward function when the agent only has samples of states and actions from an experienced agent. Previous IRL methods assume that the expert is trained in a discounted environment, and the discount factor is known.


Inverse Reinforcement Learning with the Average Reward Criterion

Neural Information Processing Systems

We study the problem of Inverse Reinforcement Learning (IRL) with an average-reward criterion. The goal is to recover an unknown policy and a reward function when the agent only has samples of states and actions from an experienced agent. Previous IRL methods assume that the expert is trained in a discounted environment, and the discount factor is known. We develop novel stochastic first-order methods to solve the IRL problem under the average-reward setting, which requires solving an Average-reward Markov Decision Process (AMDP) as a subproblem. To solve the subproblem, we develop a Stochastic Policy Mirror Descent (SPMD) method under general state and action spaces that needs \mathcal{O}(1/\varepsilon) steps of gradient computation.


RVI-SAC: Average Reward Off-Policy Deep Reinforcement Learning

Hisaki, Yukinari, Ono, Isao

arXiv.org Artificial Intelligence

These learning (DRL) method utilizing the methods utilize the discounted reward criterion, which is average reward criterion. While most existing applicable to a variety of MDP-formulated tasks (Puterman, DRL methods employ the discounted reward criterion, 1994). In particular, for continuing tasks where there is this can potentially lead to a discrepancy no natural breakpoint in episodes, such as in robot locomotion between the training objective and performance (Todorov et al., 2012) or Access Control Queuing metrics in continuing tasks, making the average Tasks(Sutton & Barto, 2018), where the interaction between reward criterion a recommended alternative. We an agent and an environment can continue indefinitely, the introduce RVI-SAC, an extension of the state-ofthe-art discount rate plays a role in keeping the infinite horizon off-policy DRL method, Soft Actor-Critic return bounded. However, discounting introduces an undesirable (SAC) (Haarnoja et al., 2018a;b), to the average reward effect in continuing tasks by prioritizing rewards criterion. Our proposal consists of (1) Critic closer to the current time over those in the future. An approach updates based on RVI Q-learning (Abounadi et al., to mitigate this effect is to bring the discount rate 2001), (2) Actor updates introduced by the average closer to 1, but it is commonly known that a large discount reward soft policy improvement theorem, and rate can lead to instability and slower convergence(Fujimoto (3) automatic adjustment of Reset Cost enabling et al., 2018; Dewanto & Gallagher, 2021).